Goto

Collaborating Authors

 task question


Review for NeurIPS paper: Modeling Task Effects on Meaning Representation in the Brain via Zero-Shot MEG Prediction

Neural Information Processing Systems

Summary and Contributions: This paper presents a re-analysis of the MEG experiment of Sudre et al (2012), where participants were tasked with responding to a question about the meaning of an object concept word (e.g. In the original Sudre et al analysis, the focus was on testing the predictive power of different perceptual and semantic feature models of the concept word for the MEG data. In the current study, the focus is on the role of the task question that precedes the concept word, and in particular whether and how the semantics of the task question modulates the subsequent processing and neural activity time-locked to the stimulus word. This is an interesting neurocognitive question, as it sheds light on how lexical-semantic representation and access can be modulated by the preceding context, and how the timing of processing of the target concept word that is independent of the task demands relates to the timing of the processing that involves integrating that conceptual knowledge with the task requirements in order to respond on the task. To analyze the data, the authors construct vector-based semantic models of both the concept words and the task questions, using human responses from separate questions and concepts where the participants rated the truth of the task questions for the concepts.


Bootstrap Your Own Context Length

Wang, Liang, Yang, Nan, Zhang, Xingxing, Huang, Xiaolong, Wei, Furu

arXiv.org Artificial Intelligence

We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. Our method utilizes a simple agent workflow to synthesize diverse long-context instruction tuning data, thereby eliminating the necessity for manual data collection and annotation. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection, all of which are readily accessible within the open-source ecosystem. Subsequently, language models are fine-tuned using the synthesized data to extend their context lengths. In this manner, we effectively transfer the short-context capabilities of language models to long-context scenarios through a bootstrapping process. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens, achieving superior performance across various benchmarks.


CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Siegel, Zachary S., Kapoor, Sayash, Nagdir, Nitya, Stroebl, Benedikt, Narayanan, Arvind

arXiv.org Artificial Intelligence

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.